I have been increasingly surprised over the past year at what I perceive to be the lack of critical thinking in response to new announcements regarding AI model capabilities. I’ve written about many of these issues before, but as AI hype accelerates the problems seem to be getting worse. Here I want to highlight a few points that seem to be neglected in coverage of the growth of AI capabilities.
Advertising is treated as fact
Every new model release by a major tech company represents billions of dollars of expenditure and the potential for billions of dollars of additional investment. Model releases and their associated documentation should therefore be interpreted as advertising designed primarily to attract customers and investors. Their goal is not to provide an objective and thorough analysis of the strengths and weaknesses of each new model. Of course, this does not mean that everything reported is false, but it does mean that such releases should be treated with significant skepticism. It is therefore disappointing to me to see write-ups such as this one by Rob Wiblin, which consists almost entirely of uncritical restatement of claims made by Anthropic, with little to no additional analysis.
For example, in discussing the capabilities of Claude Mythos to identify software vulnerabilities, Wiblin says:
“Anthropic’s previous model Opus 4.6 could only successfully convert a bug it identified in the browser Firefox into an effective way to accomplish something really bad 1% of the time. Mythos could do it 72% of the time.”
This statement is selective and misleading. What Anthropic actually did was provide their models with a testing harness which mimicked Firefox 147 but without critical defence components. They then prompted the models to devise and implement a certain type of exploit. Mythos fully accomplished this 72% of the time, but nearly always did so using two specific bugs that have since been fixed. When Anthropic removed these two bugs, Mythos only fully succeeded 4.4% of the time. (It was unclear to me whether Mythos may have had any knowledge of these bugs in its training data, given they had already been fixed). While I do not doubt that Mythos has improved capabilities relative to previous models, the reporting here unjustifiably hypes the significance of the results without providing any substantive critical analysis. This is further highlighted by the fact that an independent analysis was able to find many of the same vulnerabilities using much smaller open-source models.
Wiblin also extrapolates well beyond what is even claimed by Anthropic, such as when he argues:
“Now, Anthropic doesn’t say this directly in their reports, but I think a common-sense interpretation of the above is that in any deployment where this AI has access to the kind of tools that would make it actually useful to people — the ability access some parts of the network and execute code — could probably break out of whatever software box we try to put it in, because the systems that we would be trying to restrain it are themselves made of software, and that software is going to have vulnerabilities nobody knows about that this model is superhumanly good at finding and taking advantage of”
In my view, it is reasonable to think that humans armed with improved automated techniques for identifying software vulnerabilities would be better, rather than worse, at constraining the behaviour of new models. This is in fact what Anthropic argues in their report. There may be differences of opinion about this, but this is an example of where hype seems to be substituting for genuine analysis.
Wiblin also comments on the fact that Anthropic has yet to release Mythos publicly:
“And also keep in mind that on Monday — the day before Anthropic published all of this — we learned that their annualised revenue run rate had grown from $9 billion at the end of December to $30 billion just three months later…
That exploding revenue is a pretty good proxy for how much more useful the previous release, Opus 4.6, has become for real-world tasks. If the past relationship between capability measures and usefulness continues to hold, the economic impact of Mythos once it becomes available is going to dwarf everything that came before it — which is part of why Anthropic’s decision not to release it is a serious one, and actually quite a costly one for them.
They’re sitting on something that would likely push their revenue run rate into the hundreds of billions, but they’ve decided it’s simply not worth the risk.”
Wiblin does not consider the possibility that the reason why Anthropic is publishing these claims without the corresponding model is to build hype for a model that is not actually ready for release yet, especially in the lead up to Anthropic’s upcoming IPO. Wiblin does not explain where his estimate of ‘hundreds of billions of dollars’ of revenue comes from, but it reads to me like pure marketing for potential investors. Nor does it make sense to claim that revenue is a measure of economic value when Anthropic, OpenAI, and others are massively subsidising usage. There is a discussion to be had about the implication of these issues, but it is not to be found in this (or similar) pieces I’ve seen on the subject. We need to do better than just uncritically repeating advertising talking points of billion-dollar tech companies.
Benchmarks are interpreted uncritically
Much of the claimed improvement in performance of models derives from rapidly increasing scores on various benchmarks, which are standardised tests designed to quantify model capabilities on tasks such as language, coding, reasoning, and image recognition. While these benchmark scores give the appearance of precise and objective tests, in practice they often have very limited value in assessing the rate of capability improvement in a meaningful way.
First, most benchmarks have not been validated. Validity is an important concept in research generally and especially human psychometrics. It refers to the extent to which a metric has been assessed as adequately measuring the underlying phenomenon of interest. There are many components of validity, and validity assessments require careful research to assess the relationship between test performance and the target phenomenon. However, few AI benchmarks report this sort of research. Most simply come up with tasks the researchers hope are related to the target capability. This is simply poor research practice. It cannot simply be determined by intuition whether a given set of tasks will provide reliable and valid information about the intended capability of interest. This requires carefully designed research.
Second, almost as soon as they are released the benchmark solutions begin to contaminate the training data of new models. For instance, memorisation is known to be a major problem for SWE-Bench, a widely used benchmark of software engineering tasks. A recent analysis of visual benchmarks has even found that models could outperform humans on a standard X-ray question-answering benchmark without even being provided with any images. A particularly concerning analysis found it was possible to achieve 100% on several major benchmarks without solving a single task, usually by exploiting simple vulnerabilities in the test pipeline or the way scores are computed.
Third, even when the test solutions are not publicly available, the training data often sufficiently resemble the test data that a model trained on the training data will see dramatically improved performance on the test questions as well. This would not be a problem if the train and test problems constituted a representative sample of the domain of interest, but for so many important topics (language, reasoning, coding, image recognition), the domain is so vast and hard to characterise that it is not possible to construct a representative sample in this way. Sampling also tends to favour more common and simpler problems, and even very subtle changes in the sampling method can lead to the model learning radically different representations. This means that models tend to overfit to the training data, diminishing the value of the benchmarks in assessing out-of-distribution generalisation capabilities.
The issue of benchmark contamination is granted only a few pages of the 244 in the Mythos model card. Only a few benchmarks are assessed for contamination, with Anthropic arguing that most of the improvement on these cannot be attributed to memorisation. However, their results show that model performance degrades significantly when restricted to the 20% of the benchmark questions they assess as having least probability of memorisation. This was even true for the SWE-Bench Pro benchmark, which is supposedly ‘a contamination-resistant testbed’. This highlights the importance of devoting more attention to these issues in order to better interpret the meaning of benchmark improvements.
Negative results are ignored
I rarely see discussion in EA circles of various results which indicate fundamental limitations of existing LLM-based approaches. Numerous studies have found that these models often fail to learn the appropriate task structure, but instead learn to answer questions by learning spurious correlations and superficial heuristics that work in some constrained domain or training task, but do not generalize to variations of the task. There are also significant known limitations of the chain of thought approach which underpins reasoning models, with thought chains often being unfaithful to the actual computations that generate model predictions. In my view, there is reason to believe that these problems reflect fundamental limitations of the machine learning techniques that underpin leading models.
As a further interesting example, Claude Opus 4.7 shows a significant regression in performance on long context tasks based on the MRCR benchmark, which interestingly is precisely a benchmark that uses adversarial methods to distract the model from the task. Anthropic’s response is:
“We kept MRCR in the system card for scientific honesty, but we've actually been phasing it out slowly. It's built around stacking distractors to trick the model, which isn't how people actually use long context.”
In my view, this response indicates that Anthropic is more interested in ensuring their model works in typical use cases, rather than assessing whether it actually has robust generalisable capabilities that are indicative of what we might call ‘genuine intelligence’. This is particularly relevant for arguments relying on extrapolations of improvements in model capabilities to novel tasks and in more complex settings.
Conclusions
There is no doubt that LLM-based models have shown significant improvements in recent years. However, it is important to carefully and critically assess these advances in order to make accurate inferences about their social, political, and economic impacts. One cannot infer AI 2027-like superintelligence takeover scenarios from recent trends and developments without making significant additional assumptions about the nature of generalized intelligence, the relevance of benchmark results, and the limitations of LLM-based models. Humans have a very bad track record of predicting what tasks require ‘general intelligence’ to accomplish, and I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff. These issues are complex and demand a more nuanced, informed consideration than I often see in contemporary discussions.

You quote him as observing that their revenue tripled over the past 3 months, and some basic math tells us that another ~tripling gets them to $100B.
I'm in favor of rigor and would also have preferred him to share a more detailed model, but "pure marketing for potential investors" seems like an unfair characterization of a "predict trends will continue unchanged" forecast.
Given that you are criticising the epistemics of EAs taking AGI very seriously, I think it's reasonable to hold this post to a higher epistemic standard than a typical EA forum post. Apologies if this comes across as combative - I spent some time trying to tone it down with Claude and struggled to get something that wasn't just hedged/weak sauce. I am excited about more discussion of the capabilities of AI systems on the EA forum and would like more people to write up their takes on the current situation.
......
I think you are applying more rigour to the bullish case than the bearish one. For example, you say:
I think this is misleading for a few reasons:
On the claim that Anthropic talks about risks from their own models primarily to create hype: I find this hard to square with the evidence. Talking about how your B2B product might be extremely dangerous, or publishing lengthy documents critically assessing your own product and admitting to errors that would be difficult to identify independently (e.g. accidentally training against the CoT), is not a common marketing tactic. It feels like your model implies that companies should only release materials optimised for short-term interests, which doesn't predict the real differences in how AI companies approach releases.
Benchmarks are interpreted uncritically
The benchmark contamination arguments are worth engaging with in principle, but I'm not sure they're doing much work in practice - I don't think many people in EA are actually updating heavily on raw benchmark scores right now. METR, arguably EA's favourite benchmarking org, has been pretty vocal about their own benchmarks being saturated, so I think the community is reasonably aware of these limitations already.
Negative results are ignored
I'm genuinely uncertain what you want Anthropic and other AI companies to do here. Do you think "genuine intelligence" is easy to measure and well-defined? The more concrete concepts being used as proxies - coding ability, economic value generated, uplift - seem defensible on their own terms rather than as misleading substitutes for something more fundamental.
On "fundamental limits of LLMs" more broadly: these arguments have been made confidently by prominent researchers since the advent of LLMs and have not had a great track record. That doesn't make them wrong, but it's worth noting.
.....
I think this post would be much stronger if it applied its standards more symmetrically. It would also help to have a more concrete conclusion. The current takeaway is essentially "further research is needed", which is a claim you can make about most areas of research (so much so that it's been banned from multiple journals), but I don't have a great sense of what research would actually convince you that the "AI hype" is reasonable.
Credulous really is the right word. There is a strand of dialogue in EA circles that feels like “we called much of this many years ago” therefor “everything that transpires will mimic our thought experiments perfectly.” The marketing from frontier labs is the offspring of early EA/LW ideas. The potential for confirmation bias here is astronomical.
We should expect to get constantly nerdsniped by frontier labs. And we have. Most EAs I talk to think Claude Code has made (or nearly made) software engineering a closed loop or RSI. They see the METR graph as a direct line pointing to AGI. They see AI 2027 as a principled, ballpark estimate for encroaching doom.
More skepticism and more posts like this seem incredibly important.
Thank you for writing this. I do not agree with either of the criticisms expressed in the other comments. It is clear to me from the title of this article that the point is that more skepticism is appropriate towards the materials published by major AI laboratories, and then the article justifies this by outlining data that is problematic for a naïve interpretation of major lab press publications.
I do not agree with dismissing the writeup by AISLE. They have been publicly doing this work and writing about it for sometime, and in the write-up, they are hardly baselessly critical of Anthropic. Their fundamental point, which is backed up by their own results in their article and other writings, is that the success of models at cybersecurity tasks largely is the result of a larger apparatus around the models. We see similar things with agentic coding, where the harness is as paramount to the actual utility as the specific model.
On the financial side, I agree that EA's should take a more critical stance regarding the financial circumstances of major AI labs. These labs are racing to IPO. The underlying economics of the AI industry are well known to be problematic. You don't have to go full Zitron to see that the financial picture is more complicated than can be inferred by just charting Anthropic's reported ARR growth.
I work with AI everyday as a software engineer. I'm not some sort of luddite, but precisely because of my experience as a consumer of the technology, it is impossible not to notice the marketing hype cycle that has come to engulf the industry. Probably the dominant category of ads I personally see on Facebook now are for coding harnesses by OpenAI and Anthropic. Anyone who peruses the relevant subreddits is used to seeing a flood of astroturfed threads intended to sway readers' loyalties as customers from one to the other. These companies are spending incredible sums of money to market their products, and that should inform how we approach claims made by company figureheads. I still recall the way my stomach churned about a year ago now, maybe a month after the release of deep research, when Sam Altman, having been asked what he does in his free time, responded by saying that of course he doesn't have any free time, but if he did, he would spend it all day reading deep research reports, or something to that effect. For me, that moment was breaking the fourth wall. He was obviously being disingenuous, and so how was I to interpret everything else he had said, which I had been happily nodding along to up until that point?
Doubtless many examples could be added to the OP, but I will satisfy myself with just one. One of the earliest sources of information about Mythos was actually the Claude Code source leak, and one thing we learned from that leak is that the quality of code being generated internally at Anthropic is incredibly low. It is not difficult to find numerous reviews of the Claude Code source tearing it apart for the low quality of craftsmanship and the bugginess of the code therein (links here, here, commentary on the former here). How does that update your priors on the idea that Mythos is a huge leap forward in terms of cybersecurity capabilities? Doubtless there is some sort of way to harmonize the two—and to be clear, I do expect Mythos to be an improvement—but is it possible that current model capabilities are being overstated by an organization pumping itself before an IPO?
None of this is to say that we shouldn't be concerned about AGI. Nor is the point of the OP as I read it that we shouldn't take AGI seriously. It is that it is aggravating to see so many people in EA circles uncritically accept and repeat claims by major AI labs that seem quite dubious. I actually don't see why skepticism of major laboratory pronouncements should have any bearing on our stance on x-risk and AGI. The two issues are not the same, other than that it should cause us to distrust said labs and be more willing to do our own homework. Furthermore, I'm not stating that model capabilities aren't advanced either—I barely ever write code by hand nowadays. Again, I took the point of the OP's article and I agree with it, to be that statements by major labs about model capabilities should not be taken as straightforward recitations of the objective truth. They are embedded in a highly competitive context involving competition for vast sums of money and huge numbers of users, and they are intended to influence that context, including by yes, scaring people into buying a subscription. The OP is attempting to help others see this possibility by providing additional data and argumentation that would be hard to account for if things were as straightforward as major lab publications suggest.